Graph-Based Semi-Supervised Learning
Graph-based methods are a powerful family of semi-supervised learning algorithms that represent data as a graph where nodes are data points and edges represent similarities between points. These methods leverage the graph structure to propagate labels from labeled nodes to unlabeled nodes.
Core Concept
The fundamental idea behind graph-based semi-supervised learning is:
- Data points that are connected in the graph are likely to have the same label
- Labels can "flow" through the graph from labeled to unlabeled points
- The overall structure of the data can be inferred from both labeled and unlabeled points
Graph Construction
The first step in graph-based methods is creating a similarity graph:
Common Graph Construction Methods
-
ε-neighborhood Graph
- Connect points if their distance is less than threshold ε
- Results in an unweighted graph
-
k-Nearest Neighbors (k-NN) Graph
- Connect each point to its k nearest neighbors
- Can be directed (not symmetric) or undirected (symmetrized)
-
Fully Connected Graph with Weighted Edges
- Connect all points with weighted edges
- Weight typically defined by a kernel function:
- Gaussian kernel:
- Polynomial kernel:
Label Propagation Algorithms
1. Standard Label Propagation
The basic algorithm:
- Assign label probabilities to all nodes (1 for labeled nodes, 0/uniform for unlabeled)
- Propagate labels through the graph iteratively
- At each step, nodes receive contributions from their neighbors
- Labeled nodes maintain their original labels (clamping)
- Iterate until convergence
Mathematically:
- Let be the labels of labeled points
- Let be the labels of unlabeled points (to be determined)
- Let be the weight/similarity matrix
- Let be the diagonal degree matrix where
The update rule:
2. Label Spreading
A variation of label propagation that introduces a regularization parameter α:
- Balances between initial label assignment and propagated values
- Typically uses normalized graph Laplacian
The update rule:
Where:
- is the normalized similarity matrix
- is the initial label assignment
- is the propagation weight
3. Consistency Method (Harmonic Function)
Finds a classification function that is:
- Consistent with the labeled data
- Smooth across the entire graph
Minimizes:
Subject to for labeled points.
Graph Laplacian
The graph Laplacian matrix is central to many graph-based methods:
- Unnormalized Laplacian:
- Normalized Laplacian: